Extracting data |
您所在的位置:网站首页 › crawler pages pokemon windows yellow pages pages › Extracting data |
Tools
/
Crawler
/
Guides
Sep. 14, 2021
Extracting data
Edit this guide
A
Edit this guide
This page provides an overview of the Crawler’s extraction process. We’ll cover how pages are selected and processed, and how records are extracted from those pages. Processing a pageTo understand extraction, it is important to first understand how pages are processed by the Crawler. Pages are processed in five main steps: A page is fetched. Links and records are extracted from the page. The extracted records are indexed to Algolia. The extracted links are added to the Crawler’s URL database. For each new, non-excluded page added to the database, the process is repeated. Adding a pageWhen a crawl starts, your crawler adds all the URLs stored in the following parameters to its URL database: startUrls sitemaps extraUrlsFor each of these pages, your crawler fetches linked pages. It looks for links in any of the following formats: head > link[rel=alternate] a[href] iframe[src] area[href] head > link[rel=canonical] redirect target when HTTP code is 301 or 302However, not all links that match are added. There are a number of reasons why a page might be skipped/ignored. If a page is not ignored, its content is extracted. Extracting recordsPages are extracted by a recordExtractor. These extractors are assigned to actions via the recordExtractor parameter. This parameter links to a function that returns the data you want to index, organized in a array of JSON objects. Anatomy of a recordExtractor Copy title").text(), description: $("meta[name=description]").attr("content"), type: $('meta[property="og:type"]').attr("content"), } ]; } "> 1 2 3 4 5 6 7 8 9 10 recordExtractor: ({ url, $, contentLength, fileType }) => { return [ { url: url.href, title: $("head > title").text(), description: $("meta[name=description]").attr("content"), type: $('meta[property="og:type"]').attr("content"), } ]; } Extraction functionrecordExtractor is a custom function that take a website’s metadata, HTML (and potentially external data), and returns an array of JSON objects. ParametersThis function receives an object with several properties to help you build your final records: $: A Cheerio instance that contains the crawled website’s content (we will go over what this means in the extracting a site’s content section). url: A Location object that contains the URL of the page being crawled. filetype: the file type of the webpage (html, pdf, etc.). contentlength: the length of the webpage’s content. datasources: the external data sets that you’ve declared in your crawler and want to combine with your extraction data. helpers: a collection of functions to help you extract content and generate records.url, fileType, and contentLength provide useful metadata on the page you are crawling. However, to extract content from your webpages, you need to use the Cheerio instance ($). Return structureThe JSON objects returned by your recordExtractor are directly converted into a record in your Algolia index. They can contain any type of value as long as they are compatible with an Algolia record. However, their size must be lower than 500 KB each, and you can return a maximum of 200 records per crawled URL. Extracting a site’s contentWebsite content is accessible through a recordExtractor’s Cheerio instance ($) parameter. Cheerio is “a lean implementation of core jQuery designed specifically for the server”. Checkout Cheerio’s documentation for examples, syntax, and guidance. Guides How to configure your first crawler Extracting data with Cheerio Extracting from JavaScript based sitesYou can also use your crawler on JavaScript-based websites. To do this, set renderJavaScript to true in your crawler’s configuration file. Setting renderJavaScript to true makes the crawling process a lot slower, so you have the possibility to use it for only a subset of your website. Extracting data from non-HTML documentsYou can use Crawler to index documents (such as .pdf’s and .doc’s). Documents are transformed into HTML by a dedicated Tika Server. How to Indexing non-HTML documents Previous Verify your domains Next Enriching extraction with external data Did you find this page helpful?© Algolia · Privacy Policy · |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |